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Abstract 

A key quantity of interest in Bayesian infer¬ 
ence are expectations of functions with re¬ 
spect to a posterior distribution. Markov 
Chain Monte Carlo is a fundamental tool to 
consistently compute these expectations via 
averaging samples drawn from an approxi¬ 
mate posterior. However, its feasibility is 
being challenged in the era of so called Big 
Data as all data needs to be processed in ev¬ 
ery iteration. Realising that such simulation 
is an unnecessarily hard problem if the goal 
is estimation , we construct a computation¬ 
ally scalable methodology that allows unbi¬ 
ased estimation of the required expectations 
- without explicit simulation from the full 
posterior. The scheme’s variance is finite by 
construction and straightforward to control, 
leading to algorithms that are provably un¬ 
biased and naturally arrive at a desired er¬ 
ror tolerance. This is achieved at an average 
computational complexity that is sub-linear 
in the size of the dataset and its free pa¬ 
rameters are easy to tune. We demonstrate 
the utility and generality of the methodology 
on a range of common statistical models ap¬ 
plied to large-scale benchmark and real-world 
datasets. 

1. Introduction 

Markov Chain Monte Carlo (MCMC), used for sam¬ 
pling from posterior distributions, is one of the most 
fundamental tools in Bayesian data analysis. How¬ 
ever, the recent explosion in the amount of data to 
be analysed poses serious challenges for this method¬ 
ology as it is often infeasible to scale it to today’s sta- 
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tistical problems. This development led to a recent 
focus on methods to make MCMC practical for large 
datasets. Most research thus far has focused on de¬ 
vising alternative Markov transition kernels based on 
mini-batches of the data. These approaches either lead 
to (1) samples being drawn from an asymptotically 
approximate posterior distribution, thus reducing the 
amount of computation at the expense of introducing 
bias (Welling and Teh, 2011; Korattikara et ah, 2014; 
Chen et ah, 2014; Bardenet et ah, 2014), or (2) pre¬ 
serving the asymptotically correct invariant distribu¬ 
tion at the expense of technical requirements and mix¬ 
ing properties that might limit applicability in prac¬ 
tice (Maclaurin and Adams, 2014). The alternative 
approach is to run MCMC on small shards of the data 
and then construct a ’Consensus Monte Carlo’ esti¬ 
mator (Scott et al., 2013). At present the Consensus 
Monte Carlo algorithm lacks any theoretical guaran¬ 
tees. 

In this contribution, we propose a different view on 
the problem. We construct a scheme that directly es¬ 
timates posterior expectations with neither simulation 
from the full posterior nor construction of alternative 
approximate transition kernels - without introducing 
bias. While variance of these estimators is naturally 
increased, we will show that this increase is bounded 
by construction and straightforward to control. This 
in particular holds for the Big Data context. We ar¬ 
rive at the following desiderata for a useful methodol¬ 
ogy in unbiased Bayesian inference: (i) computational 
complexity that is sub-linear in the number of observa¬ 
tions, (ii) controllable and bounded variance, and (iii) 
no problems with transition kernel design. 

The presented framework is very general in the sense 
that it is neither restricted to a particular form of the 
underlying Bayesian model (such as factorising like¬ 
lihoods), nor does it rely on a particular inference 
technique used from within. Any free parameters are 
easy to tune. Furthermore, it is competitive in prac¬ 
tice: experimental examples show that we are able 
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to accurately and confidently estimate posterior ex¬ 
pectations, faster than with simulation methods. In 
addition, and without exploiting any domain specific 
structure, we are able to outperform state-of-the-art 
results obtained from (approximate) stochastic varia¬ 
tional inference methods on large-scale real-world data 
(Hensman et al., 2013). 

By no means do we aim to replace existing simulation 
(or any other) techniques for Bayesian inference. If the 
goal is to obtain a representation of the full posterior 
density, simulation remains the method of choice. Our 
contribution rather provides a different perspective on 
problems where a given Bayesian expectation lies at 
the core of the performed statistical analysis. 

The setting we consider is as follows. Given data 
V = {xi,..., xn}, a statistical model with parameters 
9 £ © C W, and likelihood p(x 1 , . .. ,Xn\8), we wish 
to obtain an unbiased estimator of the expectation 

E™M0)} (!) 

of a given functional p : O —> R over the full posterior 
7 tjv = p(8\xi, ... ,xn) oc p(9)p(xi,.. . ,xn\8)- While 
we focus on the real-valued p here, multivariate ex¬ 
tensions are possible. A typical way to approach this 
problem is to generate samples {0i} i=1 from ttn us¬ 
ing MCMC, and then compute the empirical expecta¬ 
tion E 7rjv {^(0)} = jj- ¥>(0*)- Note that the goal 
here is not to obtain samples from the full posterior 
measure tin ~ the focus is rather on the estimation of 
particular expectations. For example, in a regression 
setting, we might be interested in predictive posterior 
means and variances. This is the ubiquitous end goal 
in many situations in which Bayesian inference is em¬ 
ployed. Therefore, we challenge the paradigm of solely 
working towards posterior simulation for such estima¬ 
tion problems, and propose a complementary method¬ 
ology. 

A subtlety when dealing with MCMC based estimates 
is that E,r{(^(0)} for any posterior 7r is only asymp¬ 
totically correct. Therefore, any finite time MCMC 
algorithm produces estimates that contain a system¬ 
atic bias, which is subsequently made arbitrarily small 
via careful choice of simulation parameters. Parts of 
our methodology are based on such estimators. For 
the sake of simplicity, we will here assume access to 
the asymptotic limit in the same sense as finite time 
MCMC estimates are treated as ’correct’. We hint at 
a way to address this issue as an outlook. 

Moreover, our approach is not restricted to MCMC, 
but easily applies to situations where expectations 
over 7 rjv are available in closed form but require pro¬ 
hibitive amounts of computation. Such cases can 


for example be found in large-scale spatial statistics 
(Lyne et ah, 2013), or Gaussian Process models de¬ 
ployed to Big Data regimes (Rasmussen and Williams, 
2006; Hensman et ah, 2013). 

Our work builds on several breakthroughs in re¬ 
lated areas, such as unbiased estimation for stochas¬ 
tic differential equations (Rhee and Glynn, 2013) 
and for Markov chain equilibrium expectations 
(Glynn and Rhee, 2014). These developments demon¬ 
strate the overarching principle that estimation is of¬ 
ten an easier problem than simulation - a dictum we 
adopt and apply here in the context of Bayesian infer¬ 
ence. 

Paper outline We begin by summarising previous 
sub-sampling based simulation approaches in Section 
2. In Section 3, we construct unbiased estimators for 
Bayesian expectations from paths of partial posterior 
distributions. Section 4 contains a number of experi¬ 
mental illustrations where we demonstrate that our es¬ 
timator is in particular useful in the Big Data regime. 
In Section 5, we point out a number of extensions and 
conduct experiments that showcase the generality of 
the developed framework. We close with a discussion 
of shortcomings and point out future work in Section 
6 . 

2. Previous work 

A practical difficulty in dealing with the full posterior 
7tjv = p{8\xi, ■ ■ ■ , xn) oc p(Q)p{xi,... , xn\8) is that 
N is often large. This renders the computation of 
a likelihood p{x\,... ,xn\9) extremely expensive - if 
not impossible. This, for example, limits feasibility of 
MCMC algorithms to simulate from 7rjv as they require 
access to p(x i,..., Xjy\ 9) in every iteration. 

Biased MCMC The infeasibility of exact like¬ 
lihood computation has been the main focus 
of (Welling and Teh, 2011; Korattikara et al., 2014; 
Chen et al., 2014; Bardenet et al., 2014) where this is¬ 
sue is circumvented by approximations to the tran¬ 
sition kernel in MCMC. This is done using, e.g. 
stochastic gradient Langevin (Welling and Teh, 2011) 
or Hamiltonian (Chen et al., 2014) approaches, or by 
using a noisy acceptance ratio and employing a sta¬ 
tistical test (Korattikara et al., 2014) or concentration 
bounds (Bardenet et al., 2014). The well-known issue 
with this vein of work is that the approximate finite 
step-size diffusions, defined by mini-batches of data, 
are no longer corrected for induced bias. Consequently 
convergence to the correct posterior (and indeed any) 
measure is no longer guaranteed. The practical effect 
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of these approaches makes them difficult to tune and 
to obtain a well-mixing chain. Furthermore, artefacts 
of these methodologies, such as using parametric hy¬ 
pothesis testing (Korattikara et al., 2014), might even 
lead to over-confident accept/reject decisions in the 
Markov transition kernel. The latter was illustrated 
in Bardenet et al. (2014), who also substantially im¬ 
prove on these constructions by providing total varia¬ 
tion bounds to assess the quality of the approximation. 

Noisy MCMC Recently, Alquier et al. (2014) pro¬ 
vided quantification of the approximation quality of 
many ’noisy’ MCMC Algorithms, including the ones 
by Welling and Teh (2011); Korattikara et al. (2014); 
Bardenet et al. (2014). These results are an impor¬ 
tant step towards understanding the extent of the 
bias induced by employing approximate transition ker¬ 
nels. However, in practice their results require uni¬ 
form or geometric ergodicity of the original Markov 
Chain to explicitly quantify the approximation error 
(Alquier et al., 2014, Theorem 2.1) or just guarantee 
convergence (Alquier et al., 2014, Theorem 2.2), re¬ 
spectively. The first condition is too strong for most 
problems in practice while the second one does not 
give important details on how and when the approx¬ 
imate chain converges. Our work can be seen as an 
orthogonal approach, as we avoid simulation from the 
full posterior and rather directly attack the underlying 
estimation of expectations of interest, i.e. (1). 

Firefly MCMC In contrast to these approxi¬ 
mate, biased sampling methods, Firefly MCMC 
(Maclaurin and Adams, 2014) introduces an exact 
construction that neither introduces bias nor requires 
computation of a full likelihood. It is an elegant way 
of exploiting additional auxiliary variable structure. In 
this regime of computationally intractable likelihoods 
due to data size 1 , it is seen as the only approach that 
can ensure coherence of subsequent inference through 
the simulation from the asymptotically correct pos¬ 
terior. One complication with Firefly MCMC is that 
it requires availability of appropriate easily computable 
and tight lower bounds on the likelihood function, tun¬ 
ing of which requires at least one sweep through all 
data. Of course, the models for which such bounds can 
be obtained are often precisely those relatively simple 
models where the need for full and exact Bayesian in¬ 
ference over variational or other approximations might 
be questionable. While investigating the formal con¬ 
struction of such bounds in more general model classes 
is a promising way forward, the generality and appli- 

x Data size as opposed to computationally intractable 
likelihoods due to an inherently intractable functional 
defining the likelihood. 


cability of Firefly MCMC is clearly limited. More¬ 
over, while Firefly MCMC can significantly reduce the 
number of likelihood evaluations at each iteration of 
MCMC, the complexity of the scheme is linear in the 
number of observations, as resampling of q ■ N auxil¬ 
iary variables is required at each iteration, for a given 
fraction of the available data q £ (0,1). There is a 
limit as to how small the parameter q can be chosen: 
mixing time cannot be smaller than 1 /q. This means 
that the reduced number of likelihood evaluations at 
each iteration of MCMC comes at the cost of requiring 
to run the chains by a factor of 1/q longer. 

In contrast to all but one of the previous sub-sampling 
schemes considered, the estimators that we propose 
are provably unbiased and also have a sub-linear av¬ 
erage complexity in the number of observations. Un¬ 
like the only unbiased competitor, Firefly MCMC, our 
approach does not require a lower bound on the like¬ 
lihood and even extends to several other situations: 
where posterior expectations are available in closed 
form but computationally infeasible, where likelihoods 
need not factorise across observations, and where like¬ 
lihoods might themselves be unavailable in an analytic 
form, as for example in the context of pseudo-marginal 
MCMC (Andrieu and Roberts, 2009). 

3. Partial posterior path estimators 

In this section, we present a different approach to 
coherent Bayesian inference in the Big Data regime 
which exploits the paths of induced partial poste¬ 
rior distributions through the debiasing device de¬ 
veloped in Rhee and Glynn (2013); Glynn and Rhee 
(2014). A similar approach was very recently taken by 
Agapiou et al. (2014), who exploit Rhee and Glynn’s 
work for unbiased posterior estimation of expectations 
over intractable infinite-dimensional models which can 
be parametrised in terms of a series expansion of ba¬ 
sis functions. In contrast, our work directly attacks 
intractability that arises from large datasets. We see 
our contribution as a pragmatic complement to exist¬ 
ing work on debiasing Monte Carlo estimates. 

Our approach follows similar ideas as Chopin (2002), 
who presented a sequential Monte Carlo procedure 
for static target distributions by exploiting a sequence 
of partial posterior targets. Given the full posterior 
TT N = p{0\xi,...,x N ) oc p(xi,...,x N \0)p(0), we de¬ 
fine a subset V t = {xi}i £ x t of size n* = \X t \ of all data, 
where It C {1,..., IV} is a (possibly random) index 
set, with sizes 0 < n\ < < . .. Ul = N. The partial 

posterior corresponding to It is then n t = p{9\V t ) oc 
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p(T> t \9)p{9) 2 . Paths of partial posterior measures can 
be constructed by starting from the prior tto( 9) = p(9) 
and increasing the size of the batches T> t until exhaust¬ 
ing the whole set of observations, i.e. 

7T 0 ((9) -» 7Tl(0) -> tt 2 (9) ->-t tt n (9), 

where ttn{ 9) = p(8\x±,... ,Xn) is the full posterior. 

For simplicity of exposition, we here consider the case 
of a geometric increase in batch sizes. More precisely, 
we set m = a,... , n t = 2 t ~ 1 a, ...,ni = 2 i-1 a = JV, 
where L = log 2 {N/a) + 1 is the number of possible 
batch sizes, a is the smallest batch size considered. 
We assume that log 2 (lV/a) is an integer. Note that 
common ratios other than 2 are possible, and are used 
in the experiments. 

The next section presents the debiasing device, which 
is a key component in transforming estimates of ex¬ 
pectations over multiple partial posteriors ¥, nt {ip(6)} 
into an unbiased estimator of the expectation over the 
full posterior E WAf {<^(6 ) )}. 


3.1. Debiasing Lemma algorithm 

The debiasing Lemma provides a way to construct 
an unbiased estimator for the limit of a converg¬ 
ing a sequence - without evaluating all elements. 
Here, we transform a sequence of asymptotically cor¬ 
rect, but biased estimators into an unbiased esti¬ 
mator. In different contexts, the result was origi¬ 
nally discussed independently by McLeish (2011) and 
Rhee and Glynn (2012). It was shown in the present 
form in Rhee and Glynn (2013, Theorem 1); see also 
Jacob and Thiery (2013, Theorem 1.1). 

Lemma 1 (Telescoping Estimator). Let <fi and 
{4>t}^Li be real-valued random variables 3 , and let T 
be an integer-valued random variable independent of (f> 
and of i E [T > t] > 0 Vt € N. With the 
convention </>o = 0, assume that 


~ E{|^*_! -0| 2 } 

4^ V[T>t\ 


( 2 ) 


Then, 


&T ~ E 

t= 1 


4>t - 4>t-i 

P [T > t] 


is an unbiased estimator o/E{<^} ; i.e. E{^} = E{</>}. 

2 Note that while L is the number of subsets, they are 
indexed with the subscript t = 1... ,T for some variable 
T < L that will be introduced later. 

3 Agapiou et al. (2014) very recently developed a Hilbert 
space version of the Lemma. 


Moreover, 


OO 

E {(^) 2 } = E 

t =1 


E — </>| 2 j — E ||</> t — <(>| 2 j 
P [T > t] 


Finite variance for unbiased Bayesian expec¬ 
tations The above Lemma 1 is directly applicable 
in our context. We set <f> t = E^i^ (0)}, for t < L 
where the empirical expectation E W( is computed by, 
e.g. MCMC 4 on the partial posterior n t , and (f> t = <j> = 
E 7rjv {tp(9)} for t > L. In the finite data regime, the 
conditions of the above Lemmas are trivially satisfied 
since we set (f> t = <t> almost surely for t > L. The vari¬ 
ance of estimators is thus finite by construction and 
the truncation variable T needs only to be supported 
on 1,..., L, and all infinite sums can be replaced with 
sums up to L. However, variance might still increase 
increase with N without a bound. Therefore, in order 
to ensure stability of the estimators in the Big Data 
regime, we here require an analogous condition to (2) 
that will guarantee that the variance remains constant, 
i.e. that as N —> oo, 


E 


^E 

V 


V:W8)}- ^7TiV wm 



P [T > t\ 




= 0 ( 1 ). 


(3) 

Intuitively, we require that the tail of the stochastic 
truncation variable matches the rate of convergence of 
partial posterior expectations. See the next Section 
and Appendix A for details on a simple setup where 
this condition holds, along with a way to tune the trun¬ 
cation distribution P[T > t\. 


Note also that in the same fashion as in 
Rhee and Glynn (2013), one can replicate the 
random truncation procedure R times and thus 
reduce variance. More precisely, ^E=i 0T r an 
unbiased estimator of E VN {ip(9)} and its variance 
scales as 1/R. Here, {T r }^ =1 are independent copies 
of T and each is computed on a different partial 
posterior path. This implies that the scheme can be 
repeated until a desired error tolerance is attained. 
The latter can be estimated from the empirical 
variance of the . 

Algorithm 1 summarises our approach, and Figure 1 
illustrates both construction of the (f)f from partial 
posterior paths, and their distribution. 


In summary, the key properties of the described 

4 We realise that empirical expectations computed with 
MCMC are technically biased and will comment on this in 
the next Section. 
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Algorithm 1 Debiasing posterior expectations 
Input: Discrete distribution A over 1 corre¬ 

sponding to batch indices 
For r = 1,..., R (number of replications), or 
while not achieved desired error tolerance 

• Sample truncation variable T r ~ A 

• For t = 1, ... ,T r 

— Compute fa = Ett, {<fi (0)} (expectation over 
the partial posterior on batch T> t of size rit), 
e.g. by running MCMC on 7r t 

• Compute the debiased estimate </>* = 

Z-,t=l P[T r >t] 

Return: the average of debiased estimates <p* = 
5 £?=!#■ 


methodology is that full posterior expectations over 
7 tat can be estimated, with no bias introduced and with 
a bounded increase in variance. This is achieved by us¬ 
ing sub-samples of the available data - at a sub-linear 
average computational cost as we will see next. 

3.2. Practical considerations 

We now list several properties, implications, and key 
advantages of our scheme. 


Computational costs and variance Let us denote 
by r the time required to generate a single debiasing 
estimator <j)j,. Since computing (f>j, requires running 
T MCMC chains, on batch sizes ni,... ,ut, t would 
scale linearly with the overall number of likelihood 
evaluations, resulting in the average time complexity of 
E{t} = 0(¥jT{ni+...+riT})- If the batch-size increase 
is geometric, i.e. n\ = a,...,nt = 2 t ~ 1 a,... ,ul = 
2 L ~ 1 a = N, the cost becomes 0 (aE-r {2 T }j. By 
matching this with truncation probabilities A t = 
P [T = t] cx 2 ~ at , for 0 < a < 1, we obtain an aver¬ 
age cost of O (a(A/a) 1_a ), which is sub-linear in N , 
see also Appendix A. This cost reflects the amount of 
computation when only a single core is available, and 
the trivial parallelisation of the scheme allows further 
savings, as described below. 

The variance of the debiasing estimator depends on 
the rate of convergence of the partial posterior ex¬ 
pectations. In order to ensure that te variance stays 
bounded as N increases, assume that there exist a con¬ 
stant c and (3 > 0, such that for large enough N and 


4 • 


/ 



★ Prior mean 

_ i Debiasing estimate ‘P* 

True Posterior mean 
. Debiasing estimates <p* 

-2 

0 2 4 6 

U2 

Figure 1. Illustration of Algorithm 1 for the posterior mean 
of a 2D Gaussian with unknown mean fi and fixed covari¬ 
ance S. Data is V = {xi ~ A/”(xi|/r = 2, S)}D° with 
E = [(—1, 3) t , (3,1) T ], prior is p(/r) = A/’(/r|0, 1). We aim 
to compute the posterior mean f fj,p(^i\T>)dfi. Debiasing 
computes multiple posterior paths (coloured solid lines), 
which are randomly truncated (solid line stops), and then 
plugged into the debiasing estimator in (1) to estimate the 
posterior mean of fii and (coloured round dots, dotted 
lines connect path end-point to estimate). The procedure 
is averaged R = 1000 times (gray dots), after which the 
empirical mean matches the full posterior mean. A kernel 
density estimate of the gray dots is shown in the back¬ 
ground. 
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Vf < L: 


E 


KW)} 





From here, as shown in Appendix A, (3) is satisfied and 
therefore variance remains bounded as long as a < (3. 
Thus, fast convergence of partial posterior expecta¬ 
tions, e.g., /3 close to 1, can result in significant speed- 
ups of the scheme. We give examples of empirical fits 
for /3 in Appendix A. 


Tuning truncation probabilities We now de¬ 
scribe how to tune free parameters of the scheme. 
Following Glynn and Whitt (1992), if both the aver¬ 
age time complexity E{r} and the variance Var{</f>^} 
are finite, a central limit theorem holds in the limit 
where computational budget k —> oo. Namely, for a 
given computational budget «, if we denote by R K the 
number of debiasing replications that can be generated 
in k time, R K = max |f? > 0 : T r < k j and by 

(j)* = A- </>* the resulting average of debiasing 

replications, then 

(f {K) - E{^(0)}) - J\T (0, E{r}Var {#}). (4) 

Thus, the asymptotic efficiency of the debiasing esti¬ 
mator is characterised precisely by the work-variance 
product. Figure 2 demonstrates how the distribution 
of the stochastic truncation variable (here in the para¬ 
metric form P [T = t] oc 2 ~ at ) can be optimised in or¬ 
der to yield minimal asymptotic variance of the esti¬ 
mators, see also Appendix A. 


MCMC and finite time bias Any empirical ex¬ 
pectation computed from finite MCMC algorithms is 
only correct in the asymptotic limit. Consequently, 
setting f° r sampled from a fi¬ 

nite time Markov chain is problematic, as unbiased¬ 
ness of the overall approach technically corrupted. 
However, the same is true for any MCMC estimate. 
In practice, careful tuning of simulation parameters 
such as burn-in, thinning, and running multiple chains 
(Gelman and Rubin, 1992), is successfully used to re¬ 
duce finite time biases to a neglectable level. We adopt 
this mindset here for the sake of practicality and sim¬ 
plicity of presentation. A way to address the issue 
could be to apply debiasing to the Monte Carlo esti¬ 
mate itself. In Lemma 1, set (f>t,e = 
for 6i t g drawn from an approximate (not converged) 
Markov chain after i iterations, as opposed to be 
drawn from the asymptotically invariant distribution. 
This gives a sequence Via applying the debi¬ 

asing Lemma, an unbiased estimator E,r t {ip (0)} can be 



Figure 2. Complexity-Variance tradeoff as a function in a 
for N = 10000. Optimising these quantities for a minimum 
batch size of a = 128 gives a best a = 0.87. The zoomed-in 
area shows a zoom-in around the minimum (with a different 
scaling on both axes) and reveals that the minimum is well 
defined. 


constructed for any partial posterior 7r t ’s expectation. 
In a second stage, these debiased partial posterior ex¬ 
pectation estimates can be used to estimate the full 
posterior expectation - now fully unbiased. Unfortu¬ 
nately, as i is an infinite sequence, the variance 

expression in Lemma 1 is not trivially guaranteed to be 
finite anymore. See Glynn and Rhee (2014); McLeish 
(2011); Agapiou et al. (2014) for an explicit and in- 
depth treatment of unbiased Monte Carlo estimation. 

MCMC and mixing time If a Markov chain 
is, in line with above considerations, used for com¬ 
puting partial posterior expectations E Tt {</)(0)}, it 
need not be induced by any form of approxima¬ 
tion, noise injection, or state-space augmentation of 
the transition kernel. As a result, the notorious 
difficulties of ensuring acceptable mixing and prob¬ 
lems of stickiness are conveniently side-stepped - 
which is in sharp contrast to all existing approaches. 
Furthermore, while the latter are limited to par¬ 
ticular MCMC schemes (Bardenet et ah, 2014, ran¬ 
dom walk), and (Welling and Teh, 2011, Langevin), 
any MCMC procedure can be employed in our con¬ 
struction. This allows us to harvest decades of 
mathematical and engineering effort that went into 
both methodology and software packages (e.g. Stan 
(Stan Development Team, 2014). Mixing time when 
using MCMC to estimate partial posterior’s expecta¬ 
tions is not compromised by our approach, in contrast 
to for example Firefly MCMC, whose mixing time gets 
worse as the mini-batch size decreases. MCMC chains 
over partial posteriors do not suffer from such prob¬ 
lems. Indeed they are in practice often easier to han- 
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die due to their similarity to the (usually simply struc¬ 
tured) prior distribution. 

Parallelisation As computation required for each 
partial posterior 7 q in a single path can be performed 
independently, the method embarrassingly parallelises 
and expectations computed in parallel only need to be 
combined in a telescoping sum. The same holds true 
for replications of the scheme: since the computational 
cost within each truncated posterior path is dictated 
by the largest batch size that needs to be processed in 
parallel, the required wall-time in the case of geomet¬ 
ric batch-size increases is roughly halved. Therefore, 
with sufficient computational resources, the potential 
speed-up factor through parallelisation is 2 R, where R 
in practice is usually in the 100s to 1000s as we will 
see in the experiments. 

4. Experiments 

In this section, we illustrate the utility of the debiasing 
approach and compare it against other unbiased ap¬ 
proaches: full posterior sampling and Firefly MCMC. 
In particular, we show that for large-scale datasets, de¬ 
biasing can accurately and confidently estimate poste¬ 
rior expectations before full MCMC and Firefly have 
produced a single estimate. 

It is clear that running an MCMC chain on the full 
posterior, for any statistic, produces more accurate es¬ 
timates than the debiasing approach, which by con¬ 
struction has an additional intrinsic source of variance. 
This means that if it is possible to produce even only 
a single MCMC sample (after burn-in), the resulting 
posterior expectation can be estimated with less ex¬ 
pected error. It is therefore not instructive to compare 
approaches in that region. 

On comparing to Firefly MCMC For a fair com¬ 
parison of our method to Firefly MCMC, we give an 
estimate for the number of likelihood evaluation nec¬ 
essary for Firefly MCMC to produce the first sam¬ 
ple - for which there are two notable obstacles. The 
first is computing a maximum a posteriori (MAP) es¬ 
timate to initialise a lower bound on the likelihood: 
Maclaurin and Adams (2014) reported Firefly’s per¬ 
formance to be inferior to standard MCMC otherwise. 
For large datasets, MAP estimates are challenging as a 
standard gradient based optimisation scheme such as 
BFGS needs multiple evaluations of the full likelihood. 
For example, Stan’s BFGS implementation on a com¬ 
monly used benchmark dataset, a9a (Welling and Teh, 
2011; Lin et ah, 2008), takes around 40 iterations to 
reach a reasonable convergence level. While this is¬ 


sue can be somewhat sidestepped via using stochastic 
gradient descent. However, given a MAP estimate, a 
one-off cost of 0(ND ), i.e. computing sufficient statis¬ 
tics of all data (Maclaurin and Adams, 2014), cannot 
be avoided. This is challenging for extremely large 
datasets. Furthermore, Firefly is based on binary in¬ 
dicator variables that determine whether a point in 
a factorised likelihood is used for an MCMC update 
(bright) or not (dark). The second obstacle comes 
from Firefly’s parameter that is the probability of 
brightening a dark point. First, at least gd->-tW points 
need to be evaluated in each iterations, which is linear 
in N. Second, mixing time suffers at least by a factor 
of l/gd-)-b, which means that burn-in time and number 
of MCMC iterations need to be multiplied by that fac¬ 
tor to compare to full MCMC in a fair way. Together, 
this means that Firefly MCMC roughly needs the same 
number of likelihood evaluations as full MCMC before 
it produces the first sample - implying that our com¬ 
parisons to the full MCMC directly apply to Firefly 
MCMC as well. 

We now provide a number of examples, where we anal¬ 
yse convergence of the debiasing estimator up to the 
number of likelihood evaluations necessary to produce 
a single sample. All estimates are given as function of 
the number of likelihood evaluations needed to com¬ 
pute them (including burn-in). Note that, in favour of 
competing methods, we do not take parallelisation into 
account, which (given appropriate hardware) would in¬ 
crease the effective number of likelihood evaluations 
per unit time by a factor of 2 R. 

4.1. Synthetic log-Gaussian 

We first consider a toy model from Bardenet et al. 
(2014), but with more data 5 : rather than the orig¬ 
inal 10 5 , we generate 2 26 ft: 10 8 data from a log- 
Gaussian logA/"(/x, er 2 ), where p = 0 and a 2 = 2. 
Using flat priors, we sample from the joint poste¬ 
rior over (/x, a) and aim to estimate the marginal 
posterior mean of a. This posterior has extremely 
wide tails, which causes problems for MCMC meth¬ 
ods based on approximate transition kernels. In par¬ 
ticular, Bardenet et al. illustrate that even when 
using an appropriate setting of the mini-batch size, 
Korattikara et al.’s scheme (confidently) gives com¬ 
pletely wrong results. Bardenet et al.’s sampler is able 
to recover the model’s standard deviation but this 
comes at the cost of using almost all available data 
in every MCMC iteration. 

Attempting to resist the commonly followed tempta¬ 
tion of applying large-scale methodology to only medium 
sized datasets. 
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This happens despite the fact that such simple pos¬ 
terior expectations converge rapidly. Figure 3 shows 
convergence of the partial posterior mean of cr as a 
function of mini-batch size r] t . Even though all such 
estimates are biased, the plot reveals that using mul¬ 
tiple MCMC chains on subsets of constant size and 
averaging gives a small estimation error quickly. This 
raises the question whether manipulating the Markov 
transition kernel is the best way of addressing such 
problems. 

Debiasing, in contrast, is a way of exploiting rapid 
convergence of posterior expectations, while remaining 
unbiased - as demonstrated in the upper part of Figure 
4, where we show that we can recover the true model 
parameter confidently and quickly. We run a number 
of debiasing replications with a minimum batch size of 
a = 8, and geometric truncation probability a close to 
1. Each of the partial posterior expectations are com¬ 
puted via MCMC 6 * with 500 iterations after a burn-in 
of 100 iterations. We count the number of likelihood 
evaluations for each partial posterior, taking into ac¬ 
count the 600 MCMC iterations. We run full MCMC 
with the same number of iterations and burn-in (note 
that this is in favour of full MCMC as partial posterior 
distributions should be explored in fewer iterations). 
We count N = 2 26 likelihood evaluations per MCMC 
iteration, with an offset of 1001V burn-in iterations. 

Remarkably, as Figure 4 indicates, the largest partial 
posterior size was only max r {nT r } = 2048, leading to 
a maximum single replication cost of Ht = 

4088 as depicted in Figure 4, and a median nr r of 
only 16. After R = 300 replications, the number 
of data touched in total is i n t = 27, 264. 

Taking into account the 600 MCMC iterations to es¬ 
timate each partial posterior expectation, this sums 
up to 16,358,400 likelihood evaluations, which is less 
than a quarter of a single full MCMC burn-in iteration 
(N = 2 26 ), and less than 1/(4-500) ss 0.0005 times the 
number of likelihood evaluations required to complete 
the burn-in of full MCMC. 

Fast convergence of posterior expectations as in Fig¬ 
ure 3 is not an artefact of the above model. As we 
demonstrate in the next experiment, such situations 
arise in more involved inference tasks as well. 



Figure 3. Convergence of partial posteriors’ mean of cr as 
a function of sub-sample size for Bardenet et al.’s log- 
Gaussian model. We randomly sub-sample r/t data from 
the dataset and run MCMC to estimate a, whose true value 
is cr = \/2. Error bars are 95% over 150 trials. 
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4.2. Large-scale logistic regression 

We now apply our methodology to a large-scale 
Bayesian logistic regression problem on N = 10 8 data. 

6 We use the NUTS sampler (Stan Development Team, 

2014), assuming a constant number of leap-frog iterations 

in HMC. 


Figure 4. Convergence and used data for debiased mean 
of posterior mean of cr for Bardenet et al.’s Log-Gaussian 
model. Top: Debiasing estimates converge to true pos¬ 
terior statistic cr = y/2 quickly. Bottom: Small per- 
replication and cumulative data usage. 
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As full posterior simulation is infeasible for models of 
such size, we choose a synthetic dataset in order to 
quantify estimation error. 

We model binary labels of N = 10 8 observations of 
D = 9 features xt £ S> D as p(y,(3) = <j(yi/3 T Xi)p(/3), 
where a is the logit function and p(/3) are independent 
Laplace priors with a scale of 1. The bias parameter is 
absorbed into /3io and x±q = 1. To generate data, we 
sample covariates Xi ~ Af(xi\0, D~ 2 ) and label them 
positively with probabilities a(xjf3). True regression 
weights are set to = 1 for i = 1,..., D. 

As in the previous example, simple posterior statistics 
such as mean regression weights, i.e. <p(/3) = /?* for 
i = 1,..., 10, converge quickly: Figure 5 (top) reveals 
that the statistics do not significantly change when 
computed from randomly sub-sampled mini-batches 
larger than 10000, which is 4 order of magnitudes 
smaller than N = 10 s . Note that it not possible to 
run even a single MCMC chain on the full dataset - 
in contrast to our debiasing approach. 

We apply our debiasing estimator, using a minimum 
batch size of only a = 100, a geometric batch size 
increase of 2, and run Stan’s NUTS sampler for 500 
iterations after a burn-in of 100 iterations. Figure 5 
(bottom) shows examples of the convergence of the de¬ 
biasing estimator over R = 1000 replications. Taking 
into account the 500+100 inherent MCMC iterations, 
the median data usage per replication is 256-600 likeli¬ 
hood evaluations data points, the average is 722 ■ 600. 
Summing over all 1000 replications, debiasing takes 
0.021V • 600 ~ 9 N likelihood evaluations. This means 
that a full MCMC chain would not even have passed 
the 100 iterations of burn-in, while debiasing already 
converged close to the ground truth posterior statis¬ 
tic. We stress that this comparison is extremely con¬ 
servative: given appropriate computational resources, 
parallelisation allows for a speed-up of up to factor 
2 R = 2000. This means that we can reduce error bars 
by an additional large factor without increasing com¬ 
putation wall-time. 

5. Extensions 

We now describe two extensions of our framework 
that illustrate its generality compared to other sub¬ 
sampling based approaches, and give experimental il¬ 
lustration. This includes an experiment where we are 
able to outperform stochastic variational inference for 
Gaussian Processes on a large-scale real-world dataset. 
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Figure 5. Logistic regression on N = 10 s synthetic data. 
Top: Posterior mean convergence as a function of mini¬ 
batch size, for all 9 regression weights and the bias term 
(outlier in the plot). They converge from about 10000 data 
(averaged over 100 trials). Note how the bias term com¬ 
pensates for small weights in the low data regions of the 
plot. Bottom: Debiasing convergence as a function of the 
number of replications R , for the last 3 regression weights 
and bias term. Other regression weights behave similarly. 
95% error bars, dashed line indicates ground truth from 
the above plot. All R = 1000 replications correspond to 
9 N likelihood evaluations, not taking parallelisation into 
account. 
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5.1. Likelihoods need not factorise 

The debiasing device for constructing the unbiased 
estimators of posterior expectations does not require 
that the likelihood factorises, i.e. that 

N 

p(x i, ...,x N \0)= Y[p{xi\0). 

2=1 

We simply require access to partial likelihoods 
p{xi ,..., x ni 1 6) for a given batch size ni. To the best 
of our knowledge, this is in sharp contrast to all other 
methods available for MCMC in the Big Data regime, 
where the likelihood has to be computable and typi¬ 
cally is assumed to factorise. As such, debiasing over 
partial posterior paths can also be applied to cases 
where posterior distributions are available in closed 
form but only at a prohibitive amount of computa¬ 
tional cost. 

Approximate Gaussian Process regression is a 

typical example for a non-factorising likelihood. We 
focus on a simple case of predictive posterior in Gaus¬ 
sian Process (GP) regression, 

*n( y*\x*,y,X ) := p(y*\x*,y, X) 

= Af( kJiK + Xiy'y, 

k(x*, x*) — kj (K + A/) _1 fc* (5) 

where K is the covariance function evaluated 
at pairwise training covariates X, and fc* = 
(k(xi, a;*),..., fc(xjv, cc*)) T , and observation noise 
variance A (Rasmussen and Williams, 2006, Section 
2). This requires the inversion of an N x N covari¬ 
ance matrix and therefore costs 0(N 3 ) computation. 
Note that predictive mean and variance here can be 
computed exactly - no MCMC simulation is required. 
With the debiasing approach, it suffices to look at 
the expectations of partial predictive posteriors nj(y *), 
which is again based on sub-sampling all available data 
X and y. As each evaluation then requires 0(n 3 ) com¬ 
putation, the average computational costs are given by 
Et (X]J=i = & (-/V 3_ “), where we set rij = 2■ 7_1 a, 
and p t oc 2~ at as before. 

The above, however, can still be infeasible in practice, 
and further savings can be obtained by applying the 
debiasing onto the primal form of (5), in combination 
with an explicit finite rank representation of the kernel 
function k{x,x') = ^T0 X , with c/> x £ R m , e.g. inducing 
variables (Quinonero Candela and Rasmussen, 2005), 
random Fourier features (Rahimi and Recht, 2007), or 
Incomplete Cholesky (Fine and Scheinberg, 2001). By 


performing Bayesian linear regression on this explicit 
(approximate) feature space, the posterior for a single 
test feature 0 * becomes 

KN(y*\<j>*,®,y) = AT( 4 % ( < F t $ + XI) 1 $ T y, 

o'- (<]>' <I> — XI) 1 o*). (6) 

with feature matrix d> = [(/)J ,..., Evaluation 

now requires a reduced cost of 0(m 2 N). Having a 
cost linear in N for obtaining each expectation gives a 
debiasing average computational cost of O (to 2 7V 1_ “) , 
which again is sub-linear in N for a fixed feature space 
dimension to. As before, each of the mini-batches can 
be processed in parallel. 

Experimental illustration To illustrate the above 
idea, we generate N = 10 4 toy data for a univari¬ 
ate non-linear regression problem in an approximate 
feature space given by Rahimi and Recht ’s random 
Fourier features. Note that this exactly corresponds 
to a Bayesian linear regression with the mapped fea¬ 
tures. More specifically, we choose a Gaussian kernel 
with unit length scale, k(x,x') = exp (— \\x — x '\\ 2 ), 
whose associated random feature space of dimension 
to = 100 is given by the mapping 

Vm.(f) x = (cos(u>iX + 61 ),..., cos (w m x + b m )) T , 

where Wi ~ Af(0, 1) and bi ~ Unif orm(0, 27r) are fixed 
and the covariates are randomly spread in [0,10]. We 
sample a set of training labels from the correspond¬ 
ing approximate GP prior, add observation noise, and 
resample the feature space basis via Wi,bi. We then 
fit the data and compute the predictive mean from 
equation ( 6 ) for a set of 1000 randomly chosen test co¬ 
variates X * with test features <h*. The ground truth y* 
is chosen to be the predictive mean using all N = 10 4 
data . 7 

As before, we begin by exploring convergence of the 
desired posterior statistic, here averaged for multiple 
test features <£*. For a given partial posterior size, we 
repeatedly sub-sample observations and compute the 
predictive mean. Figure 6 (top) shows convergence of 
the mean squared error (MSE) of the predictive means 
for all test features as a function of partial posterior 
size. The MSE only gets close to zero when almost all 
data is used. This is unlike in previous examples and 
therefore shows that the functional corresponding to 
GP regression, i.e. equation (5), is more complicated. 

We apply the debiasing scheme and compute the aver¬ 
age computational complexity, i.e. the average size of 

7 Note that this is different to the predictive mean using 
an exact GP, but suffices for illustration purposes here, as 
the MSE is zero by construction when all data is used . 
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all partial posteriors of a single debiasing replication. 
Given that complexity, Figure 6 (top) shows the MSE 
if we were to average predictions over multiple mini¬ 
batches. In contrast, Figure 6 (bottom) reveals that 
debiasing achieves a much better MSE at the same 
average computational cost. 

To our knowledge, none of the other approximate or 
exact sub-sampling-based MCMC schemes can be ap¬ 
plied to this example. We therefore resort to compar¬ 
ing against a popular approximate inference method 
for such GP models. 

5.2. Comparing to stochastic variational 
inference on real-world data 

Another way to approach Gaussian Processes in the 
Big Data context is via stochastic variational inference 
(GP-SVI). Hensman et al. (2013) combine a decade’s 
work on sparse GPs, variational bounds, and stochas¬ 
tic gradient descent to fit huge GP models in a stream¬ 
ing fashion. The clever usage of a number of approx¬ 
imations allows them to cut the computational costs 
from 0(N 3 ) down to 0(m 3 ), where m is the num¬ 
ber of inducing variables and constants depending on 
number of iterations and mini-batch size. 


Airtime delays We apply a combination of random 
Fourier features and debiasing (as in previous Section) 
to the real-world problem of predicting arrival time 
delays in flight records (Hensman et al., 2013, Section 
4.3). This involves N = 700,000 data consisting of 
8 -dinrensional covariates and real labels. We aim to 
estimate the predictive mean of a GP for 100, 000 ran¬ 
domly chosen test covariates. We use the exponenti¬ 
ated quadratic covariance function, with a finite-rank 
expansion via random Fourier features. For the sake 
of simplicity, we do not apply a different length-scale 
to each dimension and include no bias term. Instead, 
we centre the data and re-scale to unit variance in a 
preprocessing step 8 . We match the number of random 
Fourier features m = 1000 to the number of inducing 
points in the GP-SVI experiment. 

In debiasing, We use the minimum batch size a = 500, 
and set the trucation distribution to match an average 
computational cost of roughly 2773 for each of the R = 
100 replications, which is an order of magnitude less 
than the batch size of 1000 for 1000 iterations in the 

8 Indeed, we were not able to obtain significant differ¬ 
ences working with varying length-scales or other hyper¬ 
parameters. Furthermore, Hensman et al. (2013) do not 
report predictive variance, for which tuning such parame¬ 
ters is more essential. However, random Fourier features 
are easily adapted to such covariance functions. 


GP Regression, predictive mean 



Average cost: 469 



Figure 6. Top: Convergence of MSE on test features as a 
function of data used for training. We randomly subsample 
available data, train the GP and compute the test MSE. 
95% error bars obtained by averaging over 200 trials. Note 
that the MSE eventually vanishes as we compare against 
the predictive mean obtained via using all data. Bottom: 
MSE as a function of debiasing replications. We compute 
multiple debiasing estimates for a given average cost and 
plot their running mean. The given average cost corre¬ 
sponds to a vertical slice in the above plot, i.e. predictions 
of constant sized mini-batches of size 469 gives an MSE of 
more than 0.2 while debiasing with the same average cost 
reaches almost zero error. 
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GP-SVI experiment. 

Remarkably, as shown in Figure 7, debiasing out¬ 
performs Hensman et al. (2013, Figure 7). GP-SVI 
achieves a square rooted mean squared error (RMSE) 
of 32.6, while debiasing achieves less than 27.9. Care 
has to be taken when concluding from these RMSE 
comparisons - both methods are likely to be improved 
by tuning, the full experimental protocols are not 
available, there are slight differences in finite-rank ap¬ 
proximations, etc. Instead, we make the point that 
debiasing achieves a competitive performance. How¬ 
ever, while GP-SVI is highly engineered to these very 
same GP regression models, debiasing is a more gen¬ 
eral method for estimation in Bayesian inference - with 
GP regression only being one of its applications. 

While this example is promising, we leave a thorough 
comparison with streaming variational Bayes for fu¬ 
ture work. 

5.3. Reducing bias in streaming applications 

The debiasing formalism is easily applicable in a sce¬ 
nario where the amount of data is unknown or unlim¬ 
ited, e.g. in a streaming scenario. Previously, we dis¬ 
cussed targeting the posterior given a fixed number N 
of observations {xi}^L 1 , an d constructing the stochas¬ 
tic truncation variable T such that a small probability 
remains that all observations are used for computing 
the desired expectation. In contrast, in the stream¬ 
ing scenario, we are unable to process all observations 
at a time, and the nature of the problem forces us to 
process observations in batches - which are discarded 
afterwards. Debiasing is still possible: we fix a worst 
case budget 7V max , which is the largest number of ob¬ 
servations that can be processed at a time (e.g. guided 
by the hardware restrictions). 7V max then replaces N 
in the static case: the stochastic truncation variable 
T allows processing JV max observations at maximum. 
This means that the bias with respect to the full pos¬ 
terior (of unknown size) still remains. However, as 
R <C iV max , it is typically of the order O (l/y/N max ), 
and is therefore subsumed by the error bars over R 
replications, which are of the order 0(l/y/R). 

Note that in the streaming scenario, no fully unbiased 
scheme is available. 

Toy example We compare the debiasing scheme 
with the constant-batch scheme on a simple posterior 
mean estimation in a Gaussian model with known vari¬ 
ance, where x-i ~ A f(6, 5000 2 ) with prior 9 ~ Af(0, 50 2 ) 
and true 9 = 10. Results are given in Figure 8. They 
show that the debiasing estimator is less biased and 
has more appropriate error bars where the constant 


Airtime delays 



Average cost: 2773 



Figure 7. Debiasing for the airtime delays dataset. Top: 
Convergence of RMSE as a function of mini-batch size 
using random Fourier features. The results using 1000 
data are slightly better to those reported in Hensman et al. 
(2013), speaking for our choice of hyper-parameters. Bot¬ 
tom: With an order of magnitude less average computa¬ 
tional cost than GP-SVI (see text), we are able to repro¬ 
duce a comparable RMSE on 10 5 randomly chosen test 
covariates. In particular, as in the previous GP example, 
Figure 6, we are able to obtain a better RMSE than averag¬ 
ing predictions obtained from constant batch sizes with the 
same computational costs. 95% error bars are computed 
over 20 repetitions. 
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Figure 8. Reduced bias for mean estimation in a Gaussian 
model with known variance. Our estimator is less biased 
and has more appropriate error bars than averaging over 
constant batch sizes, which is overconfidently biased. Both 
approaches have the same average computational costs. 
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batch-size approach is overconfident and strongly bi¬ 
ased. The constant batch size was chosen to make the 
computational cost of the two schemes comparable. 
Results show estimates towards the end of the total 
of 50,000 replications after each scheme has streamed 
around 10 9 datapoints. 

6. Discussion 

In this section, we present shortcomings and prob¬ 
lems, both conceptual experimental, that have to be 
addressed in future work. We close by summarising 
our contributions. 

Bias from memory restrictions In order for the 
estimator in (1) to be unbiased, one needs to assign a 
non-zero probability for each of the possible values of 
the truncation variable T - and the resulting partial 
posterior expectations need to be computable in finite 
time. In all presented examples, sampling large val¬ 
ues of T only results in a long runtime. However, such 
large T might also result in a partial posterior statistics 
that are impossible to estimate due to restrictions of 
available computing resources. A common example are 
memory limits arising from large Gaussian covariance 
matrices, see for example (Lyne et al., 2013). We side¬ 
stepped such problems in our experiments by using a 
finite-rank kernel expansion. However, in general our 
estimator is not unbiased in cases where partial pos¬ 
teriors exceed available machine memory. In practice, 
allowing a fixed computational budget and tweaking 
the truncation distribution such that larger values are 
almost never sampled, yields good results. Developing 


a more sophisticated solution is left for future work. 


Convergence on short posterior paths In exper¬ 
iments on smaller datasets, we could not beat full pos¬ 
terior sampling in terms of estimation error per compu¬ 
tation time. Only when the sub-linear average compu¬ 
tational complexity 0(N 1 ~ a ) is significantly less than 
0(N), debiasing outperforms MCMC. It will be in¬ 
teresting to study the connection of data size N and 
truncation parameter a for different classes of poste¬ 
rior functionals ip. 

Figure 9 shows results for (sparse) logistic regression 
on the a9a dataset (Lin et al., 2008; Welling and Teh, 
2011), which consists of IV = 32561 covariates of di¬ 
mension 123. Using the same model as in Section 4.2, 
we aim to estimate the posterior mean of the first re¬ 
gression weight /3i. Note that full posterior sampling 
on this dataset takes days. Figure 9 (top) shows con¬ 
vergence of partial posterior statistics, which tend to 
stabilise from about 1000 data. Convergence of debi¬ 
asing, Figure 9 (middle), behaves well at first sight. 
However, as N in this case is relatively small, the 
probability to sample a partial posterior path trunca¬ 
tion that includes the whole dataset is relatively high. 
In the presented debiasing run, this happens around 
replication 100. As this results in full posterior sam¬ 
pling, debiasing is pointless. Unfortunately, the con¬ 
vergence at this point has not yet reached an accept¬ 
able level. 

Summary 

We presented an alternative perspective on large-scale 
Bayesian inference problems, and developed a novel 
framework for approaching those in practice. For cases 
where the goal is estimation of Bayesian posterior ex¬ 
pectations, rather than simulation from the posterior, 
we side-stepped the many serious convergence prob¬ 
lems arising from employing approximate transition 
kernels of Markov chains for simulation. By exploit¬ 
ing the debiasing Lemma, we were able to estimate 
these posterior statistics efficiently from partial poste¬ 
rior statistics. Data complexity is sub-linear in N, no 
bias is introduced, variance is finite. 

Implementing our approach is trivial as it exploits ex¬ 
isting work on MCMC and easily fits in with other 
inference schemes. Free parameters are easy to tune. 
It furthermore is embarrassingly parallelisable. We 
conducted experiments to illustrate cases where de¬ 
biasing can accurately and confidently estimate pos¬ 
terior statistics before competing simulation methods 
are able to produce a single estimate. The presented 
methodology is not limited to factorising likelihoods 
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Logistic Regression a9a, posterior mean of f3i 




Figure 9. Logistic regression on the a9a dataset consist¬ 
ing of N = 32561 data of dimension D = 123. We esti¬ 
mate posterior mean of the first regression weight /3i for 
increasing data size. Top: Convergence of partial pos¬ 
terior statistics. Bottom: Convergence of the debiasing 
estimator looks promising. However, at around iteration 
100, full posterior sampling is performed. 


or MCMC as an internal inference scheme. We carried 
out experimental examples that showcased competi¬ 
tiveness of debiasing compared to full posterior sam¬ 
pling and stochastic variational inference. 

Most essential areas for future work are (i) explor¬ 
ing the computation-variance tradeoff in detail, also 
in context of other than geometric truncation distri¬ 
butions (ii) dealing with finite time bias when MCMC 
is used, (iii) a thorough formal and experimental com¬ 
parison with other large-scale inference schemes such 
as stochastic variational inference. 
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A. Computational complexity and variance for geometric batch schedule 

In this section, we show that for the simple choice of a geometrically increasing batch schedule, and a geometric 
stochastic truncation variable, it is possible to obtain sub-linear expected complexity of the debiasing scheme. 
Furthermore, variance remains bounded by a constant as N increases. 


Number of likelihood evaluations For simplicity, we assume the common ratio 2, i.e. the batch sizes are 
ni = a,...,riL = 2 L ~ 1 a = N, where L = log 2 {N/a) + 1, and a is the smallest batch size considered, and 
log 2 (7V/a) is an integer. We express computational costs in terms of the number of likelihood evaluations £. It 
is easy to see that £ is a function of the stochastic truncation variable T, i.e. 

T 

£(T) = M^n t = Ma (2 T — l), 
t= i 


where M is the length of the MCMC chains (assumed to be constant throughout). Namely, T chains need to be 
run on partial posteriors w.r.t. n\,... ,riT datapoints respectively. 

We write the truncation probability as pt := P(T = t) oc 2~ at for some a £ (0,1). The normalizing constant of 
the corresponding density is given by Z a = J2t =i 2 _at = 2 _Q (l + 2~ a + ... 2 _ “(- L_1 )) = 2~ a ~ i- 2 ~ a > 

leading to the expected number of likelihood evaluations being 


Ma 


E [£(£)] = —£(2*-1)2 


Ma 1 _ a 2^- a ) L -l 
Z a 2 1 ~ a - 1 

« 2 Ma 1 ~ 2 , ( N/a ) 1_ “ 

= O (MaflV/a) 1 -") , 

i.e. the overall complexity is sub-linear in the number of observations for a > 0. 


Variance The tail of T is 


P[T>f] = — 2~ at (l + 2- Q + ...2“ a(L - t)N ) 

Z a ' ' 

1 i _ 2-«( z '-*+ 1 ) 

___ 2 ~ ai — _ Z. _ 

1 - 2~ a 

1 o— a(L — 1 + 1 ) 

= o-a(t-l) 1 - 2 _ 

1 - 2~ aL 

2~a(t-l) _ 2~ aL 

~ 1 - 2~ aL ' 

In addition to a, variance will also depend on the rate of convergence of partial posterior expectations to the 
full posterior expectation. We denote the difference between the expectation estimated on a partial posterior i r ( 
(corresponding to nt points) and the expectation estimated on the full posterior by 

6 t := E nt {ip(6)} - E^M#)}. 

Note that we have St = 0 almost surely for t > L as the sequence of estimators terminates with the full posterior. 
This suffices that for any finite N, variance of the debiasing scheme is finite as well. However, it might grow 
without bound as N grows large - which is an undesirable situation. Remarkably, it is possible to ensure that 
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variance is On( 1), he. it stays bounded as N increases. Namely, let us assume that for large enough IV, there 
exist a constant c and /? > 0, such that Vt < L: 

E ( |S ' |2 }^ = ^17- <7) 

The coefficient /? clearly depends on the function ip: it is typically 1 for simple models and ip(6) = 0 and could be 
closer to zero for complicated functionals ip exhibiting slow convergence of partial posterior expectations. Now, 
from Lemma 1, the second moment of the debiasing estimator is precisely 


E 



< 


< 


A EUft-rlVEfl&l 2 } 

^ P [T > t\ 

ST' ^{l^«-i| 2 } 

P [T > t] 


c2@ (1 - 2~ aL ) 


E 


i 

2 (<®—“)(*—!) — 2 d(.t-l)-aL ’ 


where the last sum remains finite for L —> oo as long as a < f). This implies that the variance of the scheme 
remains bounded by a constant as the number of observations N grows large. In terms of a and /3, this upper 
bound on the second moment is approximately f° r l ar S e 

Figure 10 shows fits of equation (7) to the convergence rates of the 8% for various of the presented examples. 
This shows that /3 can be chosen close to 1 or larger for simple, quickly converging posterior statistics. Note that 
values larger than 1 imply a constant average computational cost of debiasing (independently of the number of 
observations). However, note that these are empirical fits, sensitive to noise etc. In practice, it is possible to 
estimate /3 by investigating the desired expectations on the first few partial posteriors only - comparisons being 
w.r.t. the expectation given the largest batch considered among these, rather than w.r.t. the full posterior. We 
stress that getting an accurate estimate of /? is not required for our scheme - a conservative lower bound on f3 
suffices to ensure that the parameter a used in the truncation probabilities is smaller and thus variance remains 
bounded. 


Minimising asymptotic variance From (4), the variance x cost determines the asymptotic variance of the 
debiasing scheme. Thus, provided /3 is known, one can use the derivations above to select the parameter a in 
the stochastic truncation distribution, 


*' (l - 2““') 




N 


1—ol' 
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Logistic regression 
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Figure 10. Empirical least squares fits of an ^ in equation (7) to observed partial posterior statistics’ convergence. Top- 
left: Gaussian mean, from Figure 1. Top-right: Log-Gaussian standard deviation, from Figure 3. Bottom-left: First 
regression weight of synthetic logistic regression, from Figure 5. Bottom-right: A single prediction in approximate GP 
regression, from Figure 6. 











































































